116 research outputs found

    AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests

    Get PDF
    The last improvements in programming languages, programming models, and frameworks have focused on abstracting the users from many programming issues. Among others, recent programming frameworks include simpler syntax, automatic memory management and garbage collection, which simplifies code re-usage through library packages, and easily configurable tools for deployment. For instance, Python has risen to the top of the list of the programming languages due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. Moreover, the community has helped to develop a large number of libraries and modules, tuning them to obtain great performance. However, there is still room for improvement when preventing users from dealing directly with distributed and parallel computing issues. This paper proposes and evaluates AutoParallel, a Python module to automatically find an appropriate task-based parallelization of affine loop nests to execute them in parallel in a distributed computing infrastructure. This parallelization can also include the building of data blocks to increase task granularity in order to achieve a good execution performance. Moreover, AutoParallel is based on sequential programming and only contains a small annotation in the form of a Python decorator so that anyone with little programming skills can scale up an application to hundreds of cores.Comment: Accepted to the 8th Workshop on Python for High-Performance and Scientific Computing (PyHPC 2018

    Xfor: Semantics and Performance

    Get PDF
    This paper introduces a new programming control structure called "xfor" as an extension of the classical "for" construct in C. It is designed to help one programmer to improve data locality on multi-core architectures by allowing him to express the schedule of instructions in an abstract way. This schedule is defined geometrically by mapping the iteration domains relatively to each other onto a unique referential by using specific parameters called grain and offset. A semantic framework is presented which associates a precise meaning with this syntactic construct and serves as a base for applying reliable xfor code transformations and programming strategies. These issues are illustrated with the Red-Black algorithm. Performance measurements carried out with benchmarking programs rewritten by using the xfor construct show significant execution times speed-ups

    Algebraic Tiling

    Get PDF
    International audienceIn this paper, we present an ongoing work whose aim is to propose a new loop tiling technique where tiles are characterized by their volumes-the number of embedded iterations-instead of their sizes-the lengths of their edges. Tiles of quasi-equal volumes are dynamically generated while the tiled loops are running, whatever are the original loop bounds, which may be constant or depending linearly of surrounding loop iterators. The adopted strategy is to successively and hierarchically slice the iteration domain in parts of quasi-equal volumes, from the outermost to the innermost loop dimensions. Since the number of such slices can be exactly chosen, quasi-perfect load balancing is reached by choosing, for each parallel loop, the number of slices as being equal to the number of parallel threads, or to a multiple of this number. Moreover, the approach avoids partial tiles by construction, thus yielding a perfect covering of the iteration domain minimizing the loop control cost. Finally, algebraic tiling makes dynamic scheduling of the parallel threads fairly purposeless for the handled parallel tiled loops

    The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral Modeling

    Get PDF
    International audienceThere may be a huge gap between the statements outlined by programmers in a program source code and instructions that are actually performed by a given processor architecture when running the executable code. This gap is due to the way the input code has been interpreted, translated and transformed by the compiler and the final processor hardware. Thus, there is an opportunity for efficient optimization strategies, that are dedicated to specific control structures and memory access patterns, to apply as soon as the actual runtime behavior has been discovered, even if they could not have been applied on the original source code. In this paper, we develop this idea by identifying code extracts that behave as polyhedral-compliant loops at runtime, while not having been outlined at all as loops in the original source code. In particular, we are interested in recursive functions whose runtime behavior can be modeled as polyhedral loops. Therefore, the scope of this study exclusively includes recursive functions whose control flow and memory accesses exhibit an affine behavior, which means that there exists a semantically equivalent affine loop nest, candidate for poly-hedral optimizations. Accordingly, our approach is based on analyzing early executions of a recursive program using a Nested Loop Recognition (NLR) algorithm, performing the affine loop modeling of the original program runtime behavior , which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly. We present some preliminary results showing that this approach brings recursion optimization techniques into a higher level in addition to widening the scope of the polyhe-dral model to include originally non-loop programs

    Does dynamic and speculative parallelization enable advanced parallelizing and optimizing code transformations?

    Get PDF
    International audienceThread-Level Speculation (TLS) is a dynamic and automatic parallelization strategy allowing to handle codes that cannot be parallelized at compile-time, because of insufficient information that can be extracted from the source code. However, the proposed TLS systems are strongly limited in the kind of parallelization they can apply on the original sequential code. Consequently, they often yield poor performance. In this paper, we give the main reasons of their limits and show that it is possible in some cases for a TLS system to handle more advanced parallelizing transformations. In particular, it is shown that codes characterized by phases where the memory behavior can be modeled by linear functions, can take advantage of a dynamic use of the polytope model

    Une maison antique à ‘Amrah (Syrie du Sud) de Melchior de Vogüé à nos jours

    Get PDF
    À quelques kilomètres de Shaqqa-Maximianopolis, dans le petit village de ‘Amrah, se dressent les vestiges bien conservés d’une habitation datant probablement du iiie siècle de notre ère. Elle présente un agencement inhabituel par rapport à la majorité des maisons du Hauran, que l’on peut rapprocher de celui des habitations d’époque hellénistique. Différents indices montrent qu’il pourrait s’agir de la résidence d’un notable qui recevait chez lui, un peu à manière des patroni d’Italie. La maison avait déjà retenu l’attention de Melchior de Vogüé qui en a publié un plan et une coupe. Les nombreuses erreurs qu’ils comportent nous ont intrigués, et la consultation des archives de l’explorateur a permis de mieux comprendre sa façon d’opérer sur le terrain et sa démarche intellectuelle. En effet, il faut restituer celle-ci dans le contexte du xixe siècle et des théories en vogue sur la conservation du patrimoine.A few kilometers from Shaqqa-Maximianopolis, in the small village of ‘Amrah, one can see the well‑preserved remains of a probably 3rd century ad house. It shows an unusual arrangement with regard to the other Hauran houses, which can be compared to Hellenistic dwelling. Several indications could let believe that it was a leading citizen’s residence who was used to entertain at home, just like Italian patroni. The house had already retained Melchior de Vogüé’s attention, who published a plan and a section of it. His numerous mistakes puzzled us, and the consultation of the French scholar’s archives enabled to understand the way he was working on the field and his intellectual process. In fact, it’s necessary to keep in mind the 19th century context and the theories about patrimonial preservation which were being debated at that time.خلاصة – تقوم صروح مسكان محافظ بشكل جيد ويعود إلى القرن الثالث ميلادي على بعض الكيلومترات من شكة – مكسيمينوبوليس في ضيعة العمرة. تتمثل بمخطط متقارب من البيوت العائدة إلى العصر الهيلينستي الذي يتميز بوضعية الإسطبل خارج مركز السكن وهو مخطط غير مألوف في البيوت و المساكن المعتمدة في منطقة الحوران. و تظهر العديد من الدلائل أنها كانت تخص أحد الأعيان الذي كان قد فتح داره للاستقبال على شكل Patroni المنتشرة في إيطاليا. كان مليكور دو فوغي قد نشر مخططا و مقطعا لهذا البيت وقد لحظت العديد من الأخطاء في هذه المنشورات مما أودى إلى إعادة البحث بأرشيف المكتشف لتحديد منهجية عمله ميدانيا و فكريا وقد توضح لنا أنها تليق بفلسفة المحافظة على الآثار السائدة و المتداولة في القرن التاسع عشر

    AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests

    Get PDF
    The last improvements in programming languages, programming models, and frameworks have focused on abstracting the users from many programming issues. Among others, recent programming frameworks include simpler syntax, automatic memory management and garbage collection, which simplifies code re-usage through library packages, and easily configurable tools for deployment. For instance, Python has risen to the top of the list of the programming languages due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. Moreover, the community has helped to develop a large number of libraries and modules, tuning them to obtain great performance. However, there is still room for improvement when preventing users from dealing directly with distributed and parallel computing issues. This paper proposes and evaluates AutoParallel, a Python module to automatically find an appropriate task-based parallelization of affine loop nests to execute them in parallel in a distributed computing infrastructure. This parallelization can also include the building of data blocks to increase task granularity in order to achieve a good execution performance. Moreover, AutoParallel is based on sequential programming and only contains a small annotation in the form of a Python decorator so that anyone with little programming skills can scale up an application to hundreds of cores

    Adaptive Runtime Selection of Parallel Schedules in the Polytope Model

    Get PDF
    International audienceThere is often no unique version of a program that provides the best performance in all circumstances. Compilers should rely on an adaptive runtime decision to choose which optimizing and parallelizing transformations will lead to the best performance in any execution context.We present a new adaptive framework solving two drawbacks of existing methods: it is effective since the very first execution, and it handles slight variations of input data shape and size. In our proposal, different code versions of parallel loop nests are statically generated by the compiler. At install time, each version is profiled in different execution contexts. At runtime, the execution time of each code version is predicted using the profiling results, the current input data shape and the number of available processor cores. The predicted best version is then run. Our framework handles several versions of possibly tiled parallel loops, using the polytope model for both the profiling and the dynamic selection phases. We show on several benchmark programs that our runtime system selects one of the most efficient version with a very low runtime overhead. This quick and efficient selection leads to speedups compared to the usage of a unique version in every execution context

    Automatic Collapsing of Non-Rectangular Loops

    Get PDF
    International audienceLoop collapsing is a well-known loop transformation which combines some loops that are perfectly nested into one single loop. It allows to take advantage of the whole amount of parallelism exhibited by the collapsed loops, and provides a perfect load balancing of iterations among the parallel threads. However, in the current implementations of this loop optimization , as the ones of the OpenMP language, automatic loop collapsing is limited to loops with constant loop bounds that define rectangular iteration spaces, although load imbalance is a particularly crucial issue with non-rectangular loops. The OpenMP language addresses load balance mostly through dynamic runtime scheduling of the parallel threads. Nevertheless, this runtime schedule introduces some unavoidable execution-time overhead, while preventing to exploit the entire parallelism of all the parallel loops. In this paper, we propose a technique to automatically collapse any perfectly nested loops defining non-rectangular iteration spaces, whose bounds are linear functions of the loop iterators. Such spaces may be triangular, tetrahedral, trapezoidal, rhomboidal or parallelepiped. Our solution is based on original mathematical results addressing the inversion of a multi-variate polynomial that defines a ranking of the integer points contained in a convex polyhedron. We show on a set of non-rectangular loop nests that our technique allows to generate parallel OpenMP codes that outperform the original parallel loop nests, parallelized either by using options " static " or " dynamic " of the OpenMP-schedule clause

    Handling Multi-Versioning in LLVM: Code Tracking and Cloning

    Get PDF
    International audienceInstrumentation by sampling, adaptive computing and dynamic optimization can be efficiently implemented using multiple versions of a code region. Ideally, compilers should automatically handle the generation of such multiple versions. In this work we discuss the problem of multi-versioning in the situation where each version requires a different intermediate representation. We expose the limits of nowadays compilers regarding these aspects and provide our solutions to overcome them, using the LLVM compiler as our research platform. The paper is focused on three main aspects: tracking code in LLVM IR, cloning, and communication between low-level and high-level representations. Aiming at performance and minimal impact on the behavior of the original code, we describe our strategies to guide the interaction between the newly inserted code and the optimization passes, from annotating code using metadata to inlining assembly code in LLVM IR. Our target is performing code instrumentation and optimization, with an interest in loops. We build a version in x86_64 assembly code to acquire low-level information, and multiple versions in LLVM IR for performing high-level code transformations. The selection mechanism consists in callbacks to a generic runtime system. Preliminary results on the SPEC CPU 2006 and the Pointer Intensive Benchmark suite show that our framework has a negligible overhead in most cases, when instrumenting the most time consuming loop nests
    corecore